-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched fetching #251
Batched fetching #251
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks solid.
|
||
def get_instance(kind, resource_name) | ||
if @cache.key?(kind) | ||
@cache[kind].find { |r| r.dig("metadata", "name") == resource_name } || {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the cache being a 2d array by kind and then name? (e.g. @cache[kind][resource_name] || {}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea; preprocessing it into that form up front should in theory be more efficient overall.
I changed the SyncMediator tests to be unit tests that use a set of fake resource classes and stub kubectl. The details of the cache behaviour is what we really need to cover IMO (nearly all the existing integration tests cover the critical path of the class with real resources), and I was having a hard time making the assertions I wanted about when api calls are/aren't made. The unit tests are obviously also a lot faster. Let me know what you think. I fixed a couple bugs I found in the process, and I think this is now ready for a detailed look. 🙏 |
def sync(mediator) | ||
super | ||
@latest_rs = exists? ? find_latest_rs(mediator) : nil | ||
@server_version = mediator.kubectl.server_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its ugly but we could store this as a class level var to prevent every deployment from having to make the api call. Might not be worth the optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I must have missed this comment. It wouldn't be safe to store it at the class level, since someone could be using this as a proper gem and running separate DeployTask
instances in parallel against different servers. However, we could at least cache it here and only make one call per deployment instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor things, but look good.
|
||
def fetch_by_kind(kind) | ||
raw_json, _, st = kubectl.run("get", kind, "-a", "--output=json") | ||
return unless st.success? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emit a warning of the call to kubectl fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both this cache and the polling loop overall are built to handle transient errors gracefully, I think that'd just add noise (that's the reasoning for it not being enabled in the equivalent place today anyway). If you turn on debug-level logging, it will get recorded by kubectl.run
itself.
test/unit/sync_mediator_test.rb
Outdated
assert_equal({}, missing) | ||
end | ||
|
||
def test_get_instance_does_not_populate_the_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't look finished, it also looks like its covered by the first test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it pretty much is. I'll merge them and move the note.
370bfc0
to
19000ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found a couple of nits, but nothing major.
I was wondering whether there is a reason why the code sometimes uses dig(a,b,c)
and in other places [a][b][c]
. Any particular reason?
Are we okay with only having tests for deployments and pods?
test/unit/sync_mediator_test.rb
Outdated
stub_kubectl_response('get', 'FakeConfigMap', *@params, success: false, resp: { "items" => [] }, err: 'no').times(2) | ||
stub_kubectl_response('get', 'FakeConfigMap', @fake_cm.name, *@params, resp: @fake_cm.kubectl_response, times: 1) | ||
assert_equal [], mediator.get_all('FakeConfigMap') | ||
assert_equal [], mediator.get_all('FakeConfigMap', "fake" => "false", "type" => "fakeconfigmap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test more than the line before? That both with a selector and without they don't get cached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, admittedly it's unlikely we'd introduce a behavioural difference there, but I thought it couldn't hurt. I can remove it if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope totally fine, just took me a second to figure out why you added the test. Maybe just add a comment to it so that no one else has to wonder why.
@@ -1,33 +0,0 @@ | |||
# frozen_string_literal: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: why was this deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Shopify custom resource that we 🔥 in production months ago. It's effectively dead code, so I deleted rather than updated it.
|
||
@status = if @deployment_exists && @service_exists | ||
def status | ||
if deployment_ready? && service_ready? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use def deploy_succeeded?
here as well
@proxy_service = mediator.get_instance(Service.kind, "cloudsql-#{@name}") | ||
end | ||
|
||
def status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as deploy_succeeded?
could therefore be refactored into trinary using that function.
end | ||
|
||
def status | ||
if deployment_ready? && service_ready? && configmap_ready? | ||
"Provisioned" | ||
else | ||
"Unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is def deploy_succeeded?
dead code now? https://github.com/Shopify/kubernetes-deploy/pull/251/files#diff-8abb8dfa4044255bd0d4793a0bdfdefcR23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That link isn't working for me, but that seems like a mistake. Good catch!
We were originally using an older version of ruby on this project, and I have a vague memory of updating cases like
Since we're so reliant on correct interactions with multiple versions of k8s, for the most part, this project has focussed on maintaining excellent integration coverage. We typically add unit tests when there are edge cases that are difficult/impossible to get integration coverage for. If you see any code that isn't covered, please call it out and I'll add coverage one way or another. One exception is the classes for Shopify's custom resources, which are virtually uncovered. The work to make them dynamic will give us generic coverage for them, but until then it is difficult since our test clusters don't run the backing controllers. Typically I deploy my test app locally as a 🎩 before merging any PRs that touch those classes. That definitely sucks. Maybe now's the time to change it... we could have the test itself inject dummy resources into the namespace to at least cover the basics. See also: codecov |
19000ba
to
a732963
Compare
Updated and rebased. |
(not reviewing but pretty excited about this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a self-review to re-familiarize myself with the code before shipping and found a couple minor things and a bug:
- the fetch_events/logs methods previously used a kubectl instance that did not log failure by default and I forgot to switch them to specify
log_failure: false
at the command level now that they are passed a variable kubectl instance - Fix in response to Danny's comment about caching the server version
- The service refactor had two errors in it that weren't caught by tests, so I fixed them and added a unit test suite for that class. Incidentally, that class's logic is incomplete and flawed when you consider the full array of service types possible; we should revisit it.
def sync(mediator) | ||
super | ||
@latest_rs = exists? ? find_latest_rs(mediator) : nil | ||
@server_version = mediator.kubectl.server_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I must have missed this comment. It wouldn't be safe to store it at the class level, since someone could be using this as a proper gem and running separate DeployTask
instances in parallel against different servers. However, we could at least cache it here and only make one call per deployment instance.
🎩 with a test app with shopify CRs successful |
Problem
Every polling loop, including the one to check the initial status, makes an API call for every single resource in the set. When you've got over a thousand resources, that takes a long time, even with concurrency!
Solution
Make one request per resource type in the set instead, with a fallback if there's a cache miss for a type a resource instance needs.
We tested this in our canary environment and saw the "checking initial status" step decrease from 24s to 6s. The
apply
+ initial polling loop step also decreased from 44s to 24s (probably becauseapply --prune
is still slow). I would expect this time to grow somewhat with larger resource sets (rawkubectl get
is slower on larger clusters), but no longer linearly with the number of resources.A nice side-effect of this design is that we can remove the
kubectl
access from the instances entirely, preventing people from using it when they shouldn't be (i.e. outside polling loops).Reviewers, please especially look for any concurrency gotchas we might have here. This new cache is effectively shared state in a multi-threaded environment. Should we freeze some of it, for example?
Still to do